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ABSTRACT 

We propose a novel mechanism for answering sets of count- 
ing queries under differential privacy. Given a workload of 
counting queries, the mechanism automatically selects a dif- 
ferent set of "strategy" queries to answer privately, using 
those answers to derive answers to the workload. The main 
algorithm proposed in this paper approximates the optimal 
strategy for any workload of linear counting queries. With no 
cost to the privacy guarantee, the mechanism improves sig- 
nificantly on prior approaches and achieves near-optimal er- 
ror for many workloads, when applied under (e, 5)-differential 
privacy. The result is an adaptive mechanism which can help 
users achieve good utility without requiring that they reason 
carefully about the best formulation of their task. 

1. INTRODUCTION 

Differential privacy [lO] guarantees that information re- 
leased about participants in a data set will be virtually indis- 
tinguishable whether or not their personal data is included. 
There are now many algorithms satisfying differential pri- 
vacy [s], however, when adopting differential privacy, users 
must reason carefully about alternative mechanisms and the 
formulation of their task. Their choices may have a signifi- 
cant impact on the utility of the output, for the same level 
of privacy. Even using the PINQ framework 16 , designed 
to aid uninitiated users in writing differentially-private pro- 
grams, users can be faced with vastly different degrees of 
accuracy depending on how their task is expressed. 

Further, there are few results showing that proposed algo- 
rithms are optimally accurate — that is, that they introduce 
the least possible distortion required to satisfy the privacy 
criterionr] Thus, if the utility they achieve is unacceptable, 
users often do not know if better utility is possible with a 



^For a single numerical query, the addition of appropriately- 
scaled discrete Laplace noise satisfies e-differential privacy 
and has been proven optimally accurate 11,. For workloads 
of multiple queries, optimally accurate mechanisms are not 
known. 



different algorithm, or if their utility goals are fundamentally 
incompatible with differential privacy. 

In this work, we attempt to relieve the user of some of 
these difficulties by developing a mechanism that automati- 
cally adapts to the set of submitted queries and provides sig- 
nificantly improved utility over competing approaches. We 
focus on batch query answering, in which a set of queries is 
answered at one time, in a single interaction with the pri- 
vate server. We call the set of queries a workload, which 
we allow to be any collection of linear counting queries. 
This general class of queries can be used to express his- 
tograms, marginals, data cubes, empirical cumulative distri- 
bution functions, common aggregation queries with group- 
ing, and more. 

One of the motivations for considering batch query-ans- 
wering of large workloads is to avoid the complications of 
online mechanisms in which a user must carefully manage 
their privacy budget, and, in addition, multiple users may be 
required to share a single privacy budget to avoid a breach 
of the privacy definition resulting from collusion. It is there- 
fore appealing to structure large workloads that contain the 
sufficient statistics of a data mining task, or which can si- 
multaneously support the intended tasks of a group of users. 
In fact, the output of our algorithms can often be treated as 
a synthetic data set, albeit one which is tailored specifically 
for accuracy on the queries in the given workload. 

The standard approach for answering a workload of queries 
under e-differential privacy is the Laplace mechanism, which 
adds to each query a sample chosen independently at ran- 
dom from a Laplace distribution. The noise distribution is 
scaled to the sensitivity of the workload: the maximum pos- 
sible change to the query answers induced by the addition 
or removal of one tuple. Large workloads often have high 
sensitivity, in which case the Laplace mechanism results in 
extremely noisy query answers because the noise added to 
each query in the workload is proportional to the sensitivity 
of the workload. 

Recently, a number of related approaches have been pro- 
posed which improve on the Laplace mechanism, sometimes 
allowing for low error where only unacceptably high error 
was possible before. They each embody a basic (but per- 
haps counter-intuitive) principle: better results are possible 
when you don't ask for what you want. 

The earliest example of this approach focuses on work- 
loads consisting of sets of k-way marginals, for which Barak 
et al. answer a set of Fourier basis queries using the Laplace 
mechanism, and then derive the desired marginals [2]. For 
workloads consisting of all range-count queries over an or- 



dered domain, two approaches have been proposed. Xiao et 
ah 21 first answer a set of wavelet basis queries, while Hay 
et al. [13] use a hierarchical set of counting queries which 
recursively decompose the domain. For workloads consist- 
ing of sets of marginals. Ding et al. [t] recently proposed a 
method for selecting an alternative set of marginals, from 
which the desired counts can be derived. 

These techniques can each be described in the framework 
of the recently-proposed matrix mechanism 
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Given a 

workload of queries, the matrix mechanism uses the Laplace 
mechanism to answer a set of strategy queries. The answers 
to the strategy queries are then used to derive answers to 
the workload queries by finding a solution that minimizes 
squared error. (The derivation by least squares is implicit in 
Barak [i] and Xiao [2l], but explicit in Hay [Ts] and Ding 
[7]). In these terms, the four approaches described above can 
each be seen as providing a set of strategy queries suitable 
for a particular kind of workload. Ultimately, the use of 
the strategy queries and the derivation process result in a 
more complex, non-independent noise distribution which can 
reduce error. 

The matrix mechanism makes clear that nearly any set 
of strategy queries can be used in this manner to answer 
a workload. Effective strategies have lower sensitivity than 
the workload, and are such that the workload queries can be 
concisely represented in terms of the strategy queries. But 
the approach remains limited to specific strategies for range 
queries 13 21 , and approaches which provide only limited 
choices of strategies for marginals Q j?] . 

We continue this line of work in order to create a truly 
adaptive mechanism that can answer a wide range of work- 
loads with low error. The key to such a mechanism is strat- 
egy selection: the problem of computing the set of strategy 
queries that minimizes error for a given workload. Unfortu- 
nately, exact solutions to the strategy selection problem are 
infeasible in practice [m]. One of our main contributions is 
an approximation algorithm capable of efficiently comput- 
ing a nearly optimal strategy in 0{n'^) time (where n is the 
number of individual counting queries required to express 
the workload). The result is a mechanism that adapts the 
noise distribution to the set of queries of interest, relieving 
the user of the burden of choosing among mechanisms or 
carefully analyzing their workload. 

A few main insights underlie our contributions. First, we 
shift our focus to (e, (5)-differential privacy, a modest relax- 
ation of e-differential privacy. The standard mechanism in 
this case is the Gaussian mechanism, which suffers the same 
limitations of the Laplace mechanism, and is also improved 
by the same approaches described above. The important 
difference for our results is that sensitivity is measured us- 
ing the L2 metric (instead of Li) which ultimately allows 
for better approximate solutionsr] Second, inspired by the 
statistical problem of optimal experimental design [5] |18| , 
we formulate the strategy selection problem as a convex op- 
timization problem which chooses n coefficients to serve as 
weights for a fixed set of design queries. Third, we show 
that the eigenvectors of the workload (when represented in 
matrix form) capture the essential building blocks required 
for near-optimal strategies and are therefore a very effective 



^Our algorithm can also be adapted to e-differential privacy, 
but it is less efficient, appears to be less effec tive , and is 



significantly harder to analyze. (Please see Sec. 3.5 



choice for the design queries underlying the above optimiza- 
tion problem. 

Our adaptive mechanism advances the state-of-the-art in 
terms of accuracy, under both absolute and relative measures 
of error: 

- For workloads targeted by prior approaches, our al- 
gorithm automatically computes strategies with uni- 
formly lower error. For marginals, our error can be re- 
duced by as much as 6.2 times over Barak and 3.2 times 
over Ding. For range queries, our error is reduced as 
much as 2.6 over Xiao and 2.7 times over Hay. 

- The power of our adaptive approach is most obvious 
when applying the mechanism to ad hoc workloads 
(which may result from specializing a larger workload 
to a given task, or by combining workloads from mul- 
tiple users). Error is reduced by as much as 13 times 
over alternative techniques. 

- Our algorithm has a provable approximation ratio and 
produces strategies with near optimal absolute error 
for many workloads of interest. We never witness an 
approximation rate greater than 1.3 times the optimal 
absolute error. For workloads of marginals, error rates 
consistently match the optimal achievable error rates. 

Our mechanism is also significantly more general than 
prior work. It can be applied to any workload of linear count- 
ing queries: a much larger class of queries than marginals or 
range queries. In addition, the algorithm avo ids a subtle 
limitation of some previous approaches [13[ |21[ [t] in which 
achieving promised error rates depends on finding a proper 
representation for the workload. 

Throughout the paper, all improvements to accuracy are 
made with absolutely no cost to privacy: accuracy is im- 
proved by constructing a better noise distribution satisfy- 
ing the same privacy condition. In addition, while strategy 
selection is the most computationally intensive part of the 
mechanism, it only needs to be performed once for any work- 
load, and need not be recomputed to re-run the mechanism 
on a new database instance. Once the selected strategy is 
preprocessed, the complexity of executing the mechanism is 
no higher than applying the standard Laplace mechanism to 
the workload. 

The paper is organized as follows. We review definitions 
and formally describe the matrix mechanism in Sec. [2] Our 
algorithm is presented in Sec. |3j along with a theoretical 
analysis that establishes the approximation rate and other 
properties. In Sec. |4]we propose performance optimizations 
which significantly improve computation time with minimal 
impact on solution quality. In Sec. [5] we evaluate both 
absolute and relative error rates of our mechanism on a range 
of workloads. We discuss related work and conclude in Sec. 
Hand Sec. [71 

2. BACKGROUND 

In this section we describe our data model and privacy 
conditions. We also review the fundamentals of the matrix 
mechanism, including error measurement and the problem 
of strategy selection. Throughout the paper, we use the no- 
tation of linear algebra and employ standard techniques of 
matrix analysis. For a matrix A, A"^ is its transpose and 
trace(A) is the sum of values on the main diagonal. If A is 
a square matrix with full rank, A~^ denotes its inverse. We 



use diag{ci, . . . c„) to indicate an nx n diagonal matrix with 
scalars d on the diagonal. 

2.1 Data Model, Linear Queries, and Workloads 

The workloads considered in this paper consist of count- 
ing queries over a single relation. Let the database I be an 
instance of a single-relation schema R{A), with attributes 
A — {Ai,A2, ■ . ■ ,Ak}. The crossproduct of the attribute 
domains, written dom(A), is the set of all possible tuples 
that may occur in /. 

In order to express our queries, we first transform the in- 
stance / into a data vector x of cell counts. We may choose to 
fully represent instance / by defining the vector x with one 
cell for every element of dom(A). Then x is a bit vector of 
size |dom(A)| with nonzero counts for each tuple present in I. 
This is often inefficient (the size of the x vector is the prod- 
uct of the attribute domain sizes) and ineffective (the base 
counts are typically too small to be estimated accurately un- 
der the privacy condition). A common way to form a vector 
of base counts over larger cells is to partition each dom{Ai) 
into di regions, which could correspond to ranges over an or- 
dered domain or individual elements (or sets of elements) in 
a categorical domain. Then the individual cells are defined 
by taking the cross-product of the regions in each attribute. 
The choice of cells in the data vector is ultimately determined 
by the workload queries that need to be expressed. 

To formally define the data vector we associate, with each 
element Xi of x, a Boolean cell condition (j>i, which evaluates 
to True or False for any tuple in dom{A). We always require 
that the cell conditions be pairwise unsatisfiable: any tuple 
in rfom(A) will satisfy exactly one cell condition. Then Xi is 
defined to be the count of the tuples from / which satisfy (j>i- 

Definition 1 (Data vector). Given an ordered Ust of 
cell conditions 4''^ , 4>2 ■ ■ ■ 4>n the data vector x is a length-n 
column vector defined by n positive integral counts Xi = \{t £ 
I I (j)^{t) IS True}\. 

In the sequel, the length of x is a key parameter, always 
denoted by n. 

Example 1. Consider the relational schema R{name, 
gradyear, gender, gpa) describing students. If we wish to 
form queries only over gender (Male or Female), and gpa 
ranges [1.0,2.0), [2.0,3.0), [3.0,3.5), [3.5,4.0), then we can 
define the 8 cell conditions enumerated in Ftg.^a). 

A linear query computes a specified linear combination of 
the elements of the data vector x. 

Definition 2 (Linear query). A linear query is a 
length-n row vector q = [qi...q„] with each qi £ R. The 
answer to a linear query q on x is the vector product qx = 
qixi H -f- qnXn- 

In addition to basic predicate counting queries, other aggre- 
gates like sum and average, as well as group- by queries, can 
be expressed as linear counting queries. A workload is a set of 
linear queries. A workload is represented as a matrix, where 
each row is a single linear counting query. 

Definition 3 (Query matrix). A query matrix is a 
collection of m linear queries, arranged by rows to form an 
m X n matrix. 



If W is an m X n query matrix, the query answer for W 
is a length m column vector of query results, which can be 
computed as the matrix product Wx. Note that cell con- 
dition (fii defines the meaning of the i**^ position of x, and 
accordingly, it determines the meaning of the i**^ column of 
a query matrix. 

Example 2. Kg. Qjf&j shows a query matrix representing 
a workload of 8 linear queries. Fig. [7](c ) describes the mean- 
ing of the queries w.r.t. the cell conditions in Fig.^^a). 

Note that the data analyst should include in the work- 
load all queries of interest, even if some queries could be 
computed from others in the workload. In the absence of 
noise introduced by the privacy mechanism, it might be rea- 
sonable for the analyst to request answers to a small set of 
counting queries, from which other queries of interest could 
be computed. (E.g., it would be sufficient to recover x itself 
by choosing the workload defined by the identity matrix.) 
But because the analyst will receive private, noisy estimates 
to the workload queries, the error of queries computed from 
their combination is often increased. Our adaptive mecha- 
nism is designed to optimize error across the entire set of 
desired queries, so all queries should be included. As a con- 
crete example, in Fig.[T][b), qs can be computed as (qi — q2) 
but is nevertheless included in the workload. 

We introduce terminology for a few common workloads 
used throughout the paper. The relevant properties of work- 
loads are reflected by their matrix representation, so we often 
drop explicit mention of the schema and attributes involved 
and focus simply on the number of distinct attributes and the 
number of disjoint buckets for each attribute, assuming that 
cells are formed uniformly in the manner described above. 

We consider predicate queries, range queries and k-way 
marginal queries. In addition, since each fc-way marginal 
query covers a single value on the margin, one may need to 
sum answers to multiple marginal queries in order to answer 
any range query on the margin. When the answers to the 
marginal queries have noise, summing introduces too much 
accumulated noise. Therefore, in this paper, we also consider 
k-way range marginal queries, each of which aggregates mul- 
tiple fe-way marginal queries so as to cover a range on the 
margin. 

Example 3. Of the queries in Fig. [7] the first seven 
are range queries (and therefore predicate queries as well). 
qi . . . qs are one-way range marginal queries, in which qi, 
q2, qa are one-way range marginal queries over gender and 
qi, q4, qs are one-way range marginal queries over gpa; q2, qs 
are also one-way marginal queries. 

We often consider large workloads consisting of all queries 
of a given type, such as "all predicate" , "all range" , "all fc- 
way range marginal" , "all fc-way marginal" and "all marginal" 
(the union of all fc-way marginals for < fc < m). Notice 
there is no workload of "all range marginal" since it is equiv- 
alent to "all range" . Later we will also consider ad hoc work- 
loads consisting of arbitrary subsets of each of these types 
of queries and their combinations. In practice such work- 
loads are important because they may arise from combining 
queries of interest to multiple users, or from specializing a 
general workload to a more specific task, to improve error. 

2.2 Differential Privacy and Gaussian Noise 
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(a) Cell conditions $ 
Figure 1: For schema R 



(c) Counting queries defined by rows of W 



(b) A query workload W 

(name, grady ear, gender, gpa), (a) shows 8 cell conditions on attributes gender and 
gpa. The database vector x (not shown) will accordingly consist of 8 counts; (b) shows a sample workload 
matrix W consisting of 8 queries, each described in (c). 
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Standard e-differential privacy 
trolled by e) on the difference in the probability of query 
answers for any two neighboring databases. For database in- 
stance I, we denote by nbrs{I) the set of databases differing 
from / in at most one record. Approximate differential pri- 
vacy [9[ [It], is a modest relaxation in which the e bound 
on query answer probabilities may be violated with small 
probability (controlled by 5). 

Definition 4. (Approximate Differential Privacy) 
A randomized algorithm K. is {e,5)-differentially private if for 
any instance I, any I' £ nbrs{I), and any subset of outputs 
S C RangeiJC) , the following holds: 

Pr[K.{I) £S]< exp(e) x Pr[K.{l') e S] + 5 

When 5 = Q, the condition is standard e-differential privacy. 

Both definitions can be satisfied by adding random noise 
to query answers. The magnitude of the required noise is de- 
termined by the sensitivity of a set of queries: the maximum 
change in a vector of query answers over any two neighboring 
databases. But the two privacy definitions difi^er in the mea- 
surement of sensitivity and in their noise distributions. Stan- 
dard differential privacy can be achieved by adding Laplace 
noise calibrated to the L\ sensitivity of the queries |10) . 
Approximate differential privacy can be achieved by adding 
Gaussian noise calibrated to the L2 sensitivity of the queries 
[9 17 . Our main results focus on approximate differential 
privacy, but we discuss extensions to standard differential 
privacy in Sec |3.5[ 

Our query workloads are represented as matrices, so we ex- 
press the sensitivity of a workload as a matrix norm. Because 
neighboring databases / and /' differ in exactly one tuple, 
and because cell conditions are disjoint, it follows that the 
corresponding vectors x and x' differ in exactly one compo- 
nent, by exactly one, in which case we write x' £ n6rs(x). 
The 1/2 sensitivity of W is equal to the maximum L2 norm 
of the columns of W. Below, cols(W) is the set of column 
vectors Wi of W. 

Proposition 1 (L2 Query matrix sensitivity). The 
L2 sensitivity of a query matrix^ is denoted ||W||2, defined 
as follows: 

||W||2 = max ||Wx-Wx'||2= max 11^112 

x'en6rs(x) WigCO/s{W) 

For example, for W in Fig.jljb), we have |jW||2 ~ \/5- 

The classic differentially private mechanism adds indepen- 
dent noise calibrated to the sensitivity of a query workload. 
We use Norma^cr)™ to denote a column vector consisting of 
m independent samples drawn from a Gaussian distribution 
with mean and scale a. 



Proposition 2. (Gaussian mechanism [oj |T7|) Given 
an m X n query matrix W, the randomized algorithm Q that 
outputs the following vector is [e, 5) -differentially private: 

e(W, x) = Wx + Normal{a)'^ 

where a = ||W||2\/2 ln(2/(5)/e 

Recall that Wx is a vector of the true answers to each 
query in W. The algorithm above adds independent Gaus- 
sian noise (scaled by e, 5, and the sensitivity of W) to each 
query answer. Thus C/(W,x) is a length-m column vector 
containing a noisy answer for each linear query in W. 

2.3 The Matrix Mechanism 

The matrix mechanism has a form similar to the Gaussian 
mechanism but adds a more complex noise vector. It uses a 
strategy matrix, A, to construct this vector. 

Proposition 3. ((e, 5)-Matrix Mechanism [m]) Given 
an m X n query matrix W, and assuming A is a full rank 
p X n strategy matrix, the randomized algorithm A4a that 
outputs the following vector is (e, S)- differentially private: 

Ma{W,x) = Wx + WA+Normalia)"'. 

where a = | A| I2 ^2 ln(2/(5)/e 

Here A"*" is the pseudo-inverse of A: A+ = (A'^ A) ^ A'^; 
if A is a square matrix, then A"*" is just the inverse of A. The 
intuitive justification for this mechanism is that it is equiva- 
lent to the following three-step process: (1) the queries in the 
strategy are submitted to the Gaussian mechanism; (2) an es- 
timate x for X is derived by computing the x that minimizes 
the squared sum of errors (this step consists of standard lin- 
ear regression and requires that A be full rank to ensure a 
unique solution); (3) noisy answers to the workload queries 
are then computed as Wx. The answers to W derived in 
step (3) are always consistent because they are computed 
from a single noisy version of the cell counts, x. 

Like the Gaussian mechanism, the matrix mechanism com- 
putes the true answer vector Wx and adds noise to each 
component. But a key difference is that the scale of the 
Gaussian noise is calibrated to the sensitivity of the strategy 
matrix A, not that of the workload. In addition, the noise 
added to query answers is no longer independent, because 
the vector of independent Gaussian samples is transformed 
by the matrix WA^. 

Example 4. The three strategy matrices in Fig. can be 
used by the matrix mechanism to answer the workload W in 
Fig.\^b), with differing results. The first strategy is the iden- 
tity matrix, the second is the Haar wavelet strategy, and the 



Identity 



1000000 
01000000 
00100000 
00010000 
00001000 
00000100 
00000010 
00000001 



Wavelet 



1 1 

1 1 

1 1 



1 -1 





LO 



111 
-1 -1 -1 



1 -1 -1 



-10 
1-1 



Adaptive Algorithm Output 



-.1 .1 .19 -.19 
.19 -.19 .1 -.1 
-.3 -.3 .33 .33 .33 .33 -.3 -.3 
.47 .47 -.5 -.5 .5 .5 -.47 -.47 
.56 .56 .53 .53 -.53 -.53 -.56 -.56 
.62 .62 .57 .57 .57 .57 .62 .62 



Figure 2: Alternative strategy matrices that can 
be used to answer workload W from Fig[l](b). The 
root mean square error of answering W when using 
the identity, wavelet, and adapted strategies is 45.36, 
34.62 and 29.79, respectively. 

third is the output of the algonthm proposed in Sec^ With 
e = 0.5 and 5 — 0.0001, if the workload itself is used as the 
strategy, the root mean square error of answering W is 47.78. 
The root mean square error using the identity, wavelet, and 
adaptive strategies is 45.36, 34.62 and 29.79, respectively. It 
is possible to prove that no strategy can answer W with error 
less than 29.18, so the algonthm is finding a nearly optimal 
strategy for this workload. 

Intuitively, by using the identity strategy, we get noisy esti- 
mates of each cell count using the Gaussian mechanism, and 
then use those estimates to compute the workload queries. 
This strategy performs poorly for workload queries that sum 
many base counts because the variance of the independent 
noise increases additively. The wavelet addresses this lim- 
itation by allowing large range queries to be estimated by 
combining the answers to just a few of the wavelet strategy 
queries. It offers a dramatic improvement over the identity 
strategy for workload consisting of all range queries. How- 
ever, the wavelet is not necessarily appropriate for every 
workload. Our algorithm produces a strategy customized to 
W, allowing for reduced error. 

2.4 Optimal Error for the Matrix Mechanism 

We measure the accuracy of a noisy query answer using 
root mean square error. For a workload of queries, the error 
is defined as the root mean square error of the vector of 
answers, which we refer to simply as workload error in the 
remainder of the paper. 

Definition 5 (Query and Workload Error). Let 
w be the estimate for query w under the matrix mechanism 
using query strategy A. That is, w = A^a(w,x). The query 
error of the estimate for w using strategy A is: 

Errora(w) ^/E[(wx — w)^]. 

Given a workload W consisting of m queries, the workload 
error of answering W using strategy A is: 



ERRORa (W) /— ^ ERRORA(Wi)2 



The query answers returned by the matrix mechanism are 
linear combinations of noisy strategy query answers to which 
independent Gaussian noise has been added. Thus, as the 
following proposition shows, we can directly compute the 
error for any linear query w or workload W as a function of 
e, 6, and A: 



Proposition 4. (Workload Error) Given a workload 
W, the error of answering W using the (e, S) matrix mech- 
anism with query strategy A is: 

Errora(W) = ||A|l2\/P(e,(5) frace(W^W(ArA)-i) (1) 

where P(e, 5) = ^i2S^. 

To build a mechanism that adapts to a given workload 
W, our goal is to select a strategy A to minimize the above 
formula. The optimal strategy for a workload W is defined 
to be one that minimizes the workload error: 

Problem 1. (Optimal Strategy Selection) Given a 
workload W, find a query strategy Ao such that: 

Errorao(W) = minAERRORA(W). (2) 

We denote the problem of computing an optimal strategy 
matrix as OPTSTaAT(W) and the workload error under this 
strategy as OptError(W). It is possible to compute an ex- 
act solution to OptSthat(W) by representing it as a convex 
optimization problem ^14j. However, encoding the neces- 
sary constraints results in a problem with a large number of 
variables and optimization takes 0(n*) time with standard 
solvers, making it infeasible for practical applications. One 
of our main goals is to efficiently find approximately optimal 
strategy matrices, for any provided workload. 

We emphasize that the algorithms in this paper optimize 
the workload error, an absolute measure of error. The solu- 
tion to this optimization problem depends on the workload 
alone, not on the input database. (This is evident from the 
fact that X, the vector of database counts, does not appear in 
Eq. ^ above.) We also consider relative error in the exper- 
imental evaluation, which inherently depends on the input 
database. We show that low relative error for a workload W 
can be achieved by optimizing (absolute) error of a workload 
whose rows have been scaled in a straightforward way. 

3. AN ALGORITHM FOR EFFICIENT 
STRATEGY SELECTION 

In this section we present an approximation algorithm for 
the strategy selection problem, prove its approximation rate 
and other properties, and discuss adapting the algorithm to 
e-differential privacy. 

3.1 Optimal Query Weighting 

The main difficulty in solving OptStrat(W) is computing 
(subject to complex constraints) all entries of a strategy 
matrix. To simplify the problem, we take inspiration from 
the related problem of optimal experimental design 
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Consider a scientist who wishes to estimate the value of n 
unknown variables as accurately as possible. The variables 
cannot be observed directly, but only by running one or more 
of a fixed set of feasible experiments, each of which returns a 
linear combination of the variables. The experiments suffer 
from observational error, but those errors are assumed inde- 
pendent, and it follows that the least square method can be 
used to estimate the unknown variables once the results of 
the experiments are collected. Each experiment has an as- 
sociated cost (which may represent time, effort, or financial 
expense) and the scientist has a fixed budget. The optimal 
experimental design is the subset (or weighted subset) of fea- 
sible experiments offering the best estimate of the unknown 
variables and with a cost less than the budget constraint. 



There is an immediate analogy to the problem of strategy 
selection: our strategy queries are like experiments that pro- 
vide partial information about the unknown data vector x, 
and the final result will be computed using the least square 
method. However, in our setting, we are permitted to ask 
any query, with a cost (arising from the increase in sensitiv- 
ity) which impacts the added noise. In addition, our goal is 
to minimize the workload error, while experimental design 
always minimizes the error of the individual variables (i.e. 
the error metric in experimental design is equivalent to our 
problem only if W is the identity matrix). 

Despite these important differences, we adopt from exper- 
imental design the idea to limit the selection of our strategy 
to weighted combinations of a set of design queries that are 
fixed ahead of time. Naturally, design queries with a weight 
of zero are omitted. For a set of design queries Q, the fol- 
lowing problem, denoted OptStratq(W), selects the set of 
weights which minimizes the workload error for W. 

Problem 2 (Approximate Strategy Selection). 
Let W he a workload and Q = {qi, . . . q^} the design queries. 
For weights A — (Ai . . . Xk) G M*", let matrix A a, a = 
. . . , Afcqfc]'^. Choose weights Aq G such that: 

ERRORAA(,,a(W) = min^gigfcERRORAA.e(W). (3) 

The solution to this problem only approximates the truly 
optimal strategy since it is limited to selecting a strategy 
that is a weighted combination of the design queries. But 
OptStratq(W) can be computed much more efficiently than 
OptStrat(W). To do so, we describe OptStratq(W) as 
a semi-definite program [H], a special form of convex opti- 
mization in which a linear objective function is minimized 
over the cone of positive semidefinite matrices. Below, o 
is the Hadamard (entry-wise) product of two matrices, and 
for symmetric matrix Q, Q >r denotes that Q is positive 
semidefinite, which means x"^Qx > for any vector x. 



Program 1 Optimal Query Weighting 



Given: 


Cl , . . . , c„ , 


Q = [qi, . . . ,q„]. 


Choose: 


Ul, . . . ,Un, 


Vl,. . . ,Vn. 


Mimimize: 


civi -(-...- 


h CnVn- 


Subject to: 




Ui 1 
1 Vi 


>0, i = l,...,n. 




( 




< 1- 



Theorem 1. Given a workload W and a set of design 
queries Q = {qi,...q„}, let ci,...,c„ be the squared L2 
norms of the columns of matrix WQ^. // the output of 
Program [£] is ui, . . . ,u„ then setting A = {^/ui . . . y'lt^} 
achieves OptStratq(W). 



Algorithms for efficiently solving semidefinite programs have 
received considerable attention recently [s]. Using standard 
algorithms. Program [l] can be solved in 0(n|Q|^) time. Re- 
call that the complexity of computing OptStrat(W) is 0(n*). 
Thus, Program [1] offers an efficiency improvement as long as 
|Q| = 0{n^). This provides a target size for selecting the 
design set, which we turn to next. 

3.2 Choosing the Design Queries 

The potential of the above approach depends on finding 
a set of design queries, Q, that is concise (containing no 



more than n'^, and preferably n, queries) and also expressive 
(so that near-optimal solutions can be expressed as weighted 
combinations of its elements). 

One straightforward idea is to adopt as the design queries 
one of the proposed strategy matrices from prior work. These 
are good strategy matrices for specific workloads such as the 
set of all range queries (wavelet or hierarchical strategy) or 
sets of low order marginals (the Fourier strategy). Choos- 
ing one of these for Q would guarantee that OptStratq(W) 
produces a solution that improves upon the error of using 
that strategy. Unfortunately these strategies are not suf- 
ficiently expressive for workloads very different from their 
target workloads. 

Another possibility is to use the workload itself as the set 
of design queries, but there are two difficulties with this. 
First, there is no guarantee that a workload includes within 
it the components from which a high quality strategy may 
be formed, especially if the workload only contains a small 
set of queries. The workloads of all range and all predicate 
queries are in fact sufficiently expressive (e.g. both the hi- 
erarchical strategy and a strategy equivalent to wavelet can 
be constructed by applying weights to the set of all range 
queries). But this leads to the second issue: these work- 
loads, and others that serve important applications, are too 
large and fail to meet our conciseness requirement. 

To avoid these pitfalls, we will derive the design set from 
the given workload W by applying tools of spectral analysis. 
Intuitively this is a good choice because the eigenvectors of 
a matrix often capture its most important properties. We 
will also show in the next section that this choice aids in 
the theoretical analysis of the approximation ratio because 
it allows us to relate the output of OptStratq(W) to a lower 
bound on error that is a function of the workload eigenvalues. 

Recall that the key part of the expression for Eqn. ([T| in 
Prop, [ijis trace(W^W(A^A)~^), and notice that the work- 
load occurs only in the form of W"^W. It follows that there 
are many workloads with equivalent error because it is easy 
to construct a matrix Wq such that W(f Wo = W"^W by let- 
ting Wo = QW for any orthogonal matrix Q. This suggests 
that, as far as workload error under the matrix mechanism 
is concerned, the essential properties of the workload are re- 
flected by W"^W. This motivates the following definition of 
eigen-queries of a workload, which we will use as our design 
set. 

Definition 6 (Eigen-queries of a workload). 
Given a workload W, consider the eigen-decomposition of 
W^W into W^W = Q^DQ, where Q is an orthogonal 
matrix and T) is a diagonal matrix. The eigen-queries of 
W are the rows of Q (i.e. the eigenvectors of W'^W ). 

Choosing the eigen-queries of W as the design set meets 
our conciseness requirement because there are never more 
than n eigen-queries. Thus Program[T] OptStratq(W), has 
complexity O(n^), which is 0{n'^) times faster than solving 
OptStrat(W). We also find that the eigen-queries meet our 
expressiveness objective. We will show this next by proving 
a bound on the approximation ratio. In Sec. |4]we propose 
techniques that exploit the fact that using subsets of the 
eigen-queries retain much of the expressiveness and increase 
efficiency. And in Section [5j we show experimentally that 
weighted eigen-queries allow for near-optimal strategies, and 
also that the eigen-queries outperform other natural alterna- 
tives for the design set. 



3.3 The Eigen-Design Algorithm 

It remains to define the complete Eigen-Design algorithm, 
which is Program [2] 

Program 2 The Eigen-Design Algorithm 
Input: Workload matrix W. 
Output: Strategy matrix A. 

1: Compute the eigenvalue decomposition of W^W = Q-^DQ, 

where D = diag{a-i, . . . , <t„) and set Q = Q. 
2: Compute weights Ai, . . . , An by solving Program [l] for above 

Q and with = 0-^, j £ [l..n]. 
3: Construct matrix A' = AQ where A = diag{Xi, . . . , A,j). 
4: Let 77111 5 • ■ ■ 5 be the L2 norm of columns of A' and define 

D' = diag{raa.Xi{^Jm^~^^n^}, . . . , maxi{ym^~^— m^^}). 
5: Return A = [ q' ] • 



The algorithm performs the decomposition of W W to 
derive the design queries (Step 1), and solves OptStratq(W) 
using the eigen-queries as the design set (Step 2). The ma- 
trix A' that is constructed in Step 3 is a candidate strategy 
but may have one or more columns whose norm is less than 
the sensitivity. In this case, it is possible to add queries, 
completing columns, without raising the sensitivity (Step 4 
and 5). These additional queries can only provide more in- 
formation about the database, and hence reduce error. 

3.4 Analysis of the Eigen-Design Algorithm 

We now consider the accuracy and generality of the eigen- 
design algorithm, showing a bound on the worst-case approx- 
imation rate and that the accuracy of the algorithm is robust 
with respect to the representation of the input workload. 

Approximation Rate 

To bound the approximation rate, we use an existing result 
showing a lower bound on the optimal error achievable for a 
workload using the (e, (5)-matrix mechanism 15 . The exis- 
tence of this bound does not imply an algorithm for achiev- 
ing it, but it is a useful tool for understanding theoretically 
and experimentally the quality of the strategies produced by 
OptStr,at(W) using the eigenvalues of W. 

Theorem 2. (Singular Value Bound [15]) Given 
any m x n workload W. Let ai,...,a„ be the eigenval- 
ues of matrix W"^W. The singular value bound of W is 
SVDB(W)=^(^(Tr+. . and bounds OptError(W); 

OptERROR(W) > \/P(e,(5)SVDB(W). 

Intuitively, let A; be the strategy that is defined by weight- 
ing the eigen queries of W by y/oi, ■ ■ ■ , y/On- The singular 
value boun d comes from u nderestimating the sensitivity of 
A; using •\/trace(A^A;)/n. In practice, though the singular 
value bound may not be achieved since there is a gap be- 
tween the sensitivity of A; and \/trace(A^A;)/n, the idea 
of weighting the eigen queries can be combined with the ex- 
perimental design method to find good strategies to W. 

Notice the strategy A; is contained in the possible solu- 
tions of Program |2] Thus the approximation ratio of Pro- 
gram [2] can be estimated by using the approximation ratio 
of the singular value bound. 

Theorem 3. Given workload W, let ai be the largest 
eigenvalue o/W"^W, Program^gives a strategy that approx- 
imates OptError(W) with a ratio of {na\/s\rDB{yV)Y^'^ . 



This theorem shows that the approximation ratio of applying 
Program [2] to a workload W can be bounded by analyzing 
the eigenvalues of matrix W"^W. 

In practice, the ratio between the error of the eigen strate- 
gies and the optimal error is much smaller for a wide range 
of common workloads. In the experiments in Sec. |5] the 
largest ratio is at most 1.3 and in a number of cases the 
ratio is essentially equal to 1, modulo numerical imprecision. 

Representation Independence 

We say that the Eigen-Design algorithm is representation 
independent because its output is invariant for semantically 
equivalent workloads and error equivalent workloads. Recall 
that the logical semantics of a workload matrix W depends 
on its cell conditions. For any workload matrix W, reorder- 
ing its cell conditions leads to a new matrix W' with accord- 
ingly reordered columns. In this case, we say W and W' are 
semantically-eq uivalent. 

Naturally, we hope for a mechanism with equal error for 
any two semantically-equivalent representations of a work- 
load. Some prior approaches do not have this property. 
For example, the wavelet and hierarchical strategies exploit 
the locality present in the canonical representation of range 
queries. An alternative matrix representation of the range 
queries may result in significantly larger error. The Eigen- 
Design algorithm does not suffer from this pitfall: 

Proposition 5 (Semantic equivalence). Let Wi 
and W2 be two semantically-equivalent workloads and sup- 
pose Prog. computes strategy Ai on workload Wi and A2 
on workload W2. Then Errorai (Wi) = Erroras (W2). 

A related issue arises for two workloads that may be se- 
mantically different, but can be shown to have equivalent 
error. Since W appears as W"^W in the expression for er- 
ror of a workload, it follows that, for any orthogonal matrix 
Q, workload QW has error equal to W under any strategy. 
And in particular, any two such workloads have equal min- 
imum error. The Eigen-Design algorithm always finds the 
same strategies for any two error-equivalent workloads: 

Proposition 6 (Error equivalence). Let Wi and 
W2 be two error- equivalent workloads (i.e. Wi — QW2 
for some orthogonal Q) and suppose Program [£| computes 
strategy Ai on workload Wi and A2 on workload W2 . Then 
Errorai(Wi) = Errora2(W2) 

This result follows from the fact that the input to Pro- 
gram [l] uses the eigenvectors of W"^W, and therefore oper- 
ates identically on equivalent workloads. 

Optimizing for Relative Error 

The discussion above is about workload error, an absolute 
measure of error. Our adaptive approach can also be used 
to find strategies offering low relative error. However, these 
are two fundamentally different optimization objectives and 
a single strategy matrix will not, in general, satisfy both. 

One major difference between computing absolute error 
and relative error is the impact of the L2 norm of a query 
vector. According to Prop. [S] and Def. [5j the query error 
of w under strategy A is proportional to the L2 norm of 
w. Therefore a scaled query few has k times larger query 
error compared with w, and thus a query with higher L2 
norm contributes more to workload error. But because the 



relative error does not change with the L2 norm of the query, 
using strategies optimized for workload error will not lead to 
optimal relative error. 

Because the matrix mechanism is a data-independent mech- 
anism, it is not possible to optimize for relative error directly. 
If the distribution of the target dataset were known, we could 
scale each query by its weighted L2 norm, where the weight 
on each cell is proportional to the inverse of its probability. 
This scaling will optimize towards relative error by neutral- 
izing the fact that the designed strategies are biased towards 
high norm queries. Since the underlying distribution is typ- 
ically unknown, we introduce a heuristic scaling, prior to 
applying the Eigen-Design algorithm, in which each query 
is normalized to make its L2 norm 1. This is equivalent to 
assuming a uniform distribution over the cells. In Sec [5j 
we show that, for two real datasets, this approach results in 
significantly lower relative error than competing techniques. 

3.5 Application to the e-Matrix Mechanism 

There are a number of challenges to applying the opti- 
mally weighted design approach under e-differential privacy. 
Recall, once again, the formula for workload error from Prop, 
[ij ||A||2Vtrace(WTW(ATA)-i). To move to e-differential 
privacy, only the sensitivity term changes, from L2 to Li: 
||A||i\/trace(W^W(A^A)-i). In the former case, the sen- 
sitivity term ||A||2 is uniquely determined by A"^A. But 
in the latter case, computing a near-optimal A"^A is not 
enough, because ||A||i remains undetermined and is itself 
hard to optimize. As a result, it is more challenging (al- 
though still possible) to represent the optimal query weight- 
ing as a convex optimization problem. We omit its formal 
encoding, but note that the resulting problem is also less ef- 
ficient because we can no longer rely on second order cone 
programming. 

Furthermore, there does not seem to be a universally good 
design set: the eigen-queries do not outperform other bases, 
in general, because they characterize only the properties 
of W"^W but do not account for the Li sensitivity. We 
can nevertheless still use our algorithm to improve existing 
strategies. For example, using the Wavelet basis in the algo- 
rithm can improve its performance on all range and random 
range queries by a factor of 1.1 and 1.5, respectively; using 
the Fourier basis can improve its performance on low order 
marginals by a factor of 1.6. 

Lastly, we do not know of an analogue of Thm [2] providing 
a guaranteed error bound for the e-Matrix Mechanism to 
verify the quality of the output. 

These challenges motivate our choice to focus on (e, 5)- 
differential privacy. While the two privacy guarantees are 
strictly-speaking incomparable, for conservative settings of 
5, a user may be indifferent between the two. It is then 
possible to show that the asymptotic error rates for many 
workloads are roughly comparable between the two models. 

4. COMPLEXITY AND OPTIMIZATIONS 

We focus next on methods to further reduce the complex- 
ity of approximate strategy selection. We first analyze the 
complexity of the strategy selection algorithm and show that 
it can be solved more efficiently for low rank workloads, with 
no impact on the quality of the solution. Then we propose 
two approaches which can significantly speed up strategy 
selection by reducing the size of the input to Program [2] In- 
tuitively, both approaches perform strategy selection over a 



summary of the workload that is constructed from its most 
significant eigenvectors, potentially sacrificing fidelity of the 
solution. We evaluate the latter two techniques in Sec |5.2[ 

4.1 Complexity Analysis 

The rank of workload matrix W, denoted by rank(W), 
is the size of the largest linearly-independent subset of the 
rows (or, equivalently, columns). When rank(W) is its max- 
imum value, n, we say that W has full rank, which im- 
plies that accurate answers to the workload queries in W 
uniquely determine every cell count in x. The complexity 
of the strategy selection algorithm can be broken into three 
parts: computing the eigenvectors and eigenvalues of matrix 
W"^W, solving the optimization problem, and constructing 
the strategy. If an eigenvalue is equal to zero, the eigenvalue 
and its corresponding eigenvectors are not actually involved 
the optimization and strategy construction, so they can be 
omitted in practice. Since the number of nonzero eigenvalues 
of W^W is equal to rank(W), the complexity of Programsji] 
is 0(nm rank(W) + nrank(W)^). 

The complexity analysis above indicates that its efficiency 
can be significantly improved when rank(W) <^ n. For ex- 
ample, the rank of low order marginal workloads can be 
bounded by the number of queries in the workload. Suppose 
a low-order marginal workload is defined on a fc-dimensional 
space of cell conditions, each of which has size d. If the work- 
load only contains one-way marginals, the complexity of solv- 
ing Program [2] over this workload is bounded by 0{k^d^'^''). 
If the workload consists of one and two-way marginals the 
complexity is 0{k^d'''^^). Both of these bounds are much 
smaller than ©(d"**). 

4.2 Workload Reduction Approaches 

Next we propose two approaches which allow us to re- 
duce the number of variables in the optimization problem. 
Both are inspired by principal component analysis (PC A), 
in which a matrix is characterized by the so-called principal 
eigenvectors, which are the eigenvectors associated with the 
largest eigenvalues. 

In our case, recall that we cannot ignore the non-principal 
eigenvectors since the rank of the strategy matrix A cannot 
be lower than the workload matrix W. Instead, we either 
compute separately the weights for the principal and remain- 
ing eigenvectors, or we choose the same weights for all the 
remaining eigenvectors. 

Eigen-Query Separation 

In eigen-query separation, we partition the eigen-queries into 
groups of a specified size according to their corresponding 
eigenvalues. Treating one group at a time, Program[T]is exe- 
cuted to determine the optimal weights just for the eigenvec- 
tors of that group. After the individual group optimizations 
are finished, another optimization can be used to calculate 
the best factor to be applied to all queries in each group. If 
the group size is large, all of the principal eigenvectors may 
be contained in one group, in which case the most important 
weights will be computed precisely. 

The complexity of eigen-query separation depends on the 
group division. Notice that during the optimization of each 
group, the convex optimization problem is equivalent to set- 
ting all eigenvalues of excluded eigenvectors to zero. Analo- 
gous to the discussion of low rank workloads, letting the size 
of group be Ug, the complexity of solving the optimization 



problem over each group is 0{nng). Similarly, the time com- 
plexity to combine all the groups is 0{n{n/ng)^), and there- 
fore 0{n'^ng + n{n/ng)^) in total. Asymptotically, the com- 
plexity of eigen-query separation is minimized when Ug = 
0{n^''^). Then the complexity of the entire process is 0{n^), 
the same as the cost of standard matrix multiplication. 

Principal Vectors Optimization 

In the principal vector optimization we use a subset of the 
k most important eigenvectors as the design set, computing 
the optimal weights as usual. Instead of ignoring the less 
important eigenvectors (as is typical in PCA) we simply use 
a single common weight for each of the excluded vectors that 
have non-zero eigenvalues. The number of variables in the 
convex optimization is reduced to fc 4- 1 so that the time 
complexity is reduced to 0{nk^). Experimentally we find 
that good results are possible with as little as 10% of the 
eigenvectors. 



In Sec. |5.2| we show that both of the above approaches 
can improve execution time by two orders of magnitude with 
modest impact on solution quality. Extending our theoretical 
bound on the approximation rate to these approaches is an 
interesting direction for future work. 

5. EXPERIMENTAL EVALUATION 

The empirical evaluation of our mechanism has three ob- 
jectives: (i.) to measure solution quality of the Eigen-Design 
algorithm using both absolute and relative error; (ii.) to mea- 
sure the trade-off between speed-up and solution quality of 
our two performance optimizations; and (Hi.) to measure 
the effectiveness of using the eigen-queries as the design set. 
Experimental conclusions are presented in Sec. |5.4[ 



Experimental Setup 

Recall that workload error is an absolute error measure based 
on root mean square error. Workload error can be analyti- 
cally computed using Prop. [4] and this is precisely the er- 
ror that will be witnessed when running repeated trials and 
computing the mean deviation. Further, workload error is 
independent of the true counts in data vector x. That is, 
it is independent of the input data. These facts hold for all 
instances of the matrix mechanism, and therefore for each 
of the competing techniques we consider below. Therefore, 
when evaluating this absolute error measure, we do not per- 
form repeated trials with samples of random noise nor do 
we use any datasets. In addition, all measures of workload 
error include the same factor P{e,S), so that changing the 
privacy parameters impacts each method with the same fac- 
tor, leaving the ratio of their error the same. Consequently, 
for workload error, we simply fix e = 0.5 and S = 0.0001. 

For workload error, all error measurements are purely a 
function of the workload, reflecting the hardness of simulta- 
neously answering a set of queries under differential privacy. 
In addition, these error rates can be compared directly with 
the lower bound as Theorem [2j reflecting a bound on the 
approximation rate. (This lower bound is not known to be 
achievable for all workloads, but nevertheless informs the 
quality of the eigen-strategy and its competitors.) 

We also evaluate the relative error rates achievable using 
our algorithm by computing the strategy that minimizes ab- 
solute error on a scaled workload, as described in Sec. |3.4[ Of 
course, the relative error rates reported in experiments are 



always for the original input workload. In these experiments 
we vary the value of e, for a fixed 5 — 0.0001, and consider 
two real datasets. The first dataset is the US individual cen- 
sus data in the past five year^ which are aggregated on age, 
occupation and income. The second is the Adult dataselj^ in 
which tuples are weight-aggregated on age, work, education 
and income. The size and dimensions of the datasets are: 



Dataset 


Dimension 


# Tuples 


US Census 


8 X 16 X 16 


15M 


Adult 


8 X 8 X 16 X 2 


33K 



Table 1: The size and dimensions of the datasets 

All experiments are executed on a quad-core 3.16GHz In- 
tel CPU with 8 GB memory. Our Python implementation 
extends publicly-available code for the matrix mechanism [2] 
and also uses the dsdp solver [3] in the cvxopt [l] package. 

Competing Approaches 

We compare the Eigen-Design strategy with the following 
four alternatives. Although originally proposed in the con- 
text of e-differential privacy, each is easily adapted to (e, 5)- 
differential privacy and the shift generally improves the rela- 
tionship to the optimal error rate (with the exception of the 
Fourier strategy, noted below). 

Fourier is designed for workloads consisting of all fc-way 
marginals, for given k The strategy transforms 

the cell counts with the Fourier transformation and 
computes the marginals from the Fourier parameters. 
When the workload is not full rank, the unnecessary 
queries of the Fourier basis are removed from the strat- 
egy to reduce sensitivity. The effectiveness of the Fourier 
strategy is somewhat reduced under (e, (5)-differential 
privacy because dropping unnecessary queries results 
in a smaller sensitivity reduction using L2. 

DataCube is an adaptive method that supports marginal 
workloads . We implemented the BMAX algorithm, 
which chooses a subset of input marginals so as to min- 
imize the maximum error when answering the input 
workload. To adapt the algorithm to (e, 5)-differential 
privacy, sensitivity is measured under L2 instead of Li. 

Wavelet supports multi-dimensional range workloads by 
applying the Haar wavelet transformation to each di- 
mension [21]. When using e-differential privacy, Xiao 
et al. also introduced a hybrid algorithm that uses the 
identity strategy on dimensions with small size. This 
optimization is unnecessary under (e, (5)-differential pri- 
vacy: the hybrid algorithm does not lead to smaller 
error when sensitivity is measured under L2. 

Hierarchical aims to answer workloads of range queries 
using a binary tree structure of queries: the first query 
is the sum of all cells and the rest of the queries re- 



cursively divide the first query into parts 13 . We test 
binary hierarchical strategies (although higher orders 
are possible). The strategy in 13 supports one di- 



mensional range workloads, but is adapted to multiple 
dimensions in a manner analogous to Wavelet [21| . 

We do not compare with the error of the standard Gaus- 
sian mechanism, which, for the workloads considered, is far 



■^Integrated Public Use Microdata Series: usa.ipums.org 
*UCI Machine Learning Repository: archive.ics.uci.edu/ml/ 
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Figure 3: Absolute and relative error for the Eigen-Design algorithm and competitors, for range and marginal 
workloads, on 2048 cells. "Lower Bound" is a bound on the best possible error achievable by any strategy. 



worse than all alternatives. Prior works [13[|21[ [7] compared 
the error rates of their approaches with the identity strategy. 
We omit this explicit comparison, since the identity is always 
within the space of possible strategies the Eigen-Design could 
choose, but is not competitive. 

5.1 Error of the Eigen-Design Algorithm 

We now measure the improvement in absolute and relative 
error offered by the Eigen-Design algorithm along with its ap- 
proximation to optimal absolute error. Below we refer to the 
strategy produced by the Eigen-Design algorithm, for a given 
workload, as the eigen- strategy. We consider three classes of 
workloads, beginning with workloads of range queries, then 
workloads of marginals, and then some alternative workloads 
designed to test the adaptivity of the mechanism. 

Workloads of Range Queries. Figs. [3](a),(b) contain ex- 
periments on workloads of all range queries and random 
range queries. The random ranges are sampled with the 
two-step sampling method in [2l]. Here the eigen-strategies 
are compared with Hierarchical and Wavelet strategy. The 
figures are in log scale, except Fig.|3|a) on all range queries. 
The results show that the eigen-design strategies reduce er- 
ror by a factor of 1.2 to 2.1 in workload error and 1.3 to 1.5 
in relative error compared to the best competing strategies. 
In addition, for workload error, the eigen-design strategy is 
within a factor of 1.3 to the lower bound. 

Workloads of Marginals. Figs. [3][c),(d) contain experi- 
ments on workloads of 2-way marginal queries and random 
marginal queries, in which the random marginals are sampled 
with the sampling method in [t]. Here the eigen-strategies 
are compared with Fourier and DataCube. The figures are 
in linear scale for workload error and log scale for relative 
error. The results show that the eigen-design strategies re- 
duce error by a factor of 1.3 to 2.2 compared to the best 
competing strategies in workload error, and by a factor of 
1.1 to 2.7 in relative error. In addition, the error of eigen- 
design strategies match the lower bound of workload error. 



indicating that our algorithm found an optimal strategy with 
respect to workload error. 



Workload 


Error Ratio 


Best/Worst 
Competitor 


Err Type 


Best/Worst 


Bound 


ID Range 
(Permuted) 


workload 


9.62/13.16 


0.99 


Wav./Hier. 


relative 


1.51/2.43 




Wav./Hier. 


IWay Range 
Marginal 


workload 


1.30/7.69 


0.98 


D.Cube/Four. 


relative 


1.36/4.93 




D. Cube/Four. 


2Way Range 
Marginal 


workload 


1.63/3.23 


0.95 


Hier./Four. 


relative 


1.81/2.38 




Wav./D.Cube 


ID CDF 


workload 


1.01/1.01 


0.80 


Wav./Hier. 


relative 


0.46/0.54 




Wav./Hier. 


Predicate 


workload 


1.39/1.94 


1.00 


Wav./Four. 


relative 


1.42/3.55 




Four./Hier. 



Table 2: The factor of error reduced for the Eigen- 
Design algorithm w.r.t. the best/worst competitors 
strategies and the theoretical bound, for alternative 
workloads, on 2048 cells. 

Alternative Workloads. To demonstrate that our mecha- 
nism is adaptive over variety of workloads, we also include 
other workloads that have not been studied in prior work. 
First we show that our mechanism adapts to semantically 
equivalent workloads, in which we repeat the experiment on 
range Workload but randomly permute the order of cell con- 
ditions. The justification for this experiment comes from the 
fact that the user may wish to answer queries in which the 
order of the cell conditions is not obvious, such as predicate 
queries over categorial attributes. 

In addition, we run experiments on three other workloads: 
the range marginals workload, the cumulative distribution 
(CDF) workload, and uniformly sampled predicate queries. 
The range marginals workload is important because most 
data analyses using marginals do not simply use individual 
counts, but also aggregate counts. If this is the case, sim- 
ply computing the marginals workload privately is the wrong 
approach because error accumulates for aggregations. Last, 
the CDF workload is a highly-skewed set of one-dimensional 
range queries where the sensitivity in the first cell is n, de- 
creasing linearly to 1 for the last cell. 
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Figure 4: Quality and efficiency of approximation methods on 8192 cell conditions 



We summarize the experimental results on alternative work- 
loads in Table [2] For relative errors, due to space con- 
straints, we only present results on US census data with 
e = 0.5 and & = 0.0001. We present, for each workload, the 
factor of error reduction achieved by our algorithm compared 
to the best and worst competing approach, whose name is 
shown in the last column of the table. (Datacube is only con- 
sidered for range marginals and Fourier is not considered on 
permuted range and CDF.) In addition, for workload error, 
we also include the ratio to the error lower bound. 

The results show that the eigen-strategy can improve ab- 
solute error by as much as 13 times (on permuted range 
queries) and relative error as much as 5 times (on one-way 
range marginals). The workload error of competing strate- 
gies is heavily impacted by the permutation but the relative 
errors are not as bad since queries of individual cells and 
small ranges dominate the workload, which do not change 
too much under permutation. On all workloads but one, the 
eigen-strategy beats every competitor by at least a factor of 
1.3, and is very close to — or achieves — the theoretical error 
lower bound. The only exception is the CDF workload, in 
which the eigen-strategy is only a bit better than the com- 
petitor for workload error and worse (than Hierarchical and 
Wavelet) for relative error. Overall, the results for workload 
and relative error are largely similar for range marginals and 
the predicate workload. 

5.2 Performance Optimizations 

Fig. |4] illustrates the trade-off between computational 
speed-up and solution quality for the eigen-separation and 
principal vector performance optimizations described in Sec- 
tion [4] We only present results with workload errors here 
(the results with relative error are similar or even better). 
Error and computation time are plotted together using two 
y-axes: the left axis measures workload error and the right 
axis measures execution time in seconds. The baselines for 
error are the lower bound and the best competing technique. 

The running time of using the standard Eigen-Design al- 
gorithm can be estimated from the running time of the prin- 
cipal vector method, which is more than an order of mag- 
nitude slower than the principal vector method with 25% of 
the eigenvectors. Comparing with this estimated time, both 
methods can reduce the running time by two orders of mag- 
nitude while the error they introduced is less than 12% over 
the lower bound. For the eigen-separation method, the com- 
putation in each group takes more time with larger group 
sizes while the computation of merging groups takes more 
time with smaller group sizes. Theoretically, the best choice 
for group size of the eigen-separation method is n^^^, which 
is closest to 16 in this case. Using eigen-query separation 
with a group size of 16, the error is 5% higher on all range 



queries and 11% higher on all marginal queries. Using the 
principal vectors optimization with 6% of the eigenvectors, 
the error is 10% higher on all range queries and the same as 
the optimal on all marginal queries. 

According to the results, the eigen-separation performs 
better on range queries while the principal vectors method 
is better on marginals. In either case, the performance im- 
provements still produce results that are significantly better 
than competing techniques. 

5.3 The Choice of Design Queries 

To evaluate our claim from Section |3.2| that the eigen- 
queries are an effective choice for the design queries we com- 
pare strategies computed by Program [T] using the eigen- 
queries, the Wavelet matrix and Fourier matrix as the design 
queries. Since using the eigen-queries introduces the same er- 
ror to semantically equivalent workloads, we also empirically 
verify this property on other sets of designed queries. Fig. [S] 
shows the results of those comparisons over two structured 
workloads considered above, as well as the same workloads 
with the order of the cell conditions permuted. 




ID Range 
on [2048] 



2D Marginal 
on [64-32] 
(Permuted) 
□ Lower Bound 



Figure 5: Comparison of design queries 

The results show that using the Fourier or the Wavelet 
strategy as the set of design queries introduces 20% more 
error over all one dimensional range queries and achieves 
the same error on two-way marginals. However these design 
queries can not maintain their performance for workloads 
represented under a permutation of the cell conditions: they 
are worse than the eigen-queries by more than 4 times over 
the permuted one-dimensional range queries. 

5.4 Experimental Conclusions 

The experimental results show that, for the workloads 
specifically targeted by competing techniques, those tech- 
niques achieve error that is not too far from optimal (usu- 
ally a factor of about 1.2 to 3.4 times the lower bound on 
error). But for broader classes or workloads, or ad hoc sub- 
sets of structured workloads, existing techniques are limited 
and the adaptivity of the Eigen-Design can improve relative 
or absolute error by a larger factor. We have confirmed the 
versatility of our algorithm, as it improves on all competing 
techniques for virtually every workload considered. The one 



exception is the highly skewed CDF workload. The lowest 
error strategy we are aware of for this workload is produced 
by our design algorithm, but with an alternative basis. 
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6. RELATED WORK 

The present work uses the framework of the matrix mecha- 
nism to develop an adaptive query answering algorithm. The 
original work on the matrix mechanism jl4j described and 
analyzed in a unified framework two prior techniques specif- 
ically tailored to range queries. The first used a wavelet 
transformation [2l]; the second used a hierarchical set of 
queries followed by inference 
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Originally, the matrix 
mechanism focused mainly on e-differential privacy, although 
(e, 5)-differentiaI privacy was also considered briefly. Prior 
work on the matrix mechanism never considered strategies 
beyond those proposed in the previous literature, or natural 
candidates like the identity matrix. The convex optimization 
formalization in prior work only runs on small n (n < 64) 
and cannot be used in practice. 

Low order marginals are studied in [4| using Fourier trans- 
formation. They also consider enforcing integral consistency 
on the output, an objective we do not consider here. Re- 
cently, Ding et al. proposed an adaptive algorithm to an- 
swer workloads consisting of data cube queries [t], which (de- 
scribed in our terms) considers strategies composed only of 
individual marginal queries and optimizes the workload error 
approximately. The algorithm adapts a known approxima- 
tion algorithm for the subset-sum problem and cannot be ap- 
plied to general linear queries. Most of these techniques focus 
on e-differential privacy, however they are actually more ef- 
fective under (e, 5)- differential privacy, so comparisons with 
our algorithms are meaningful. 

The error rates of the matrix mechanism are independent 
of the database instance. Recently, a number of data de- 
pendent algorithms for answering linear queries under diff- 
erential privacy have been proposed. Xiao et al. [22] pro- 
pose a method for computing a strategy matrix using KD- 
trees, and Cormode et al. j6j propose a related method in 
which a differentially-private median computation is used to 
guide hierarchical range queries. While promising, these ap- 
proaches appear to restrict the strategy to hierarchical struc- 
tures which we have shown are suboptimal for many work- 
loads. Dynamic strategy selection can also increase compu- 
tation cost. These tradeoffs deserve further investigation. 

Focusing on relative error, Xiao et al. f20] propose a data- 
dependent algorithm to minimize the relative error with an 
innovative resampling function. Data-dependent interactive 
(as opposed to batch) mech anisms have been considered by 
Roth and Roughgarden [T9|, who answer predicate queries on 
databases with 0-1 entries. Hardt et. al [12] provide a linear 
time algorithm for the same query and database setting. 

7. CONCLUSIONS AND FUTURE WORK 

We have described an adaptive mechanism for answering 
complex workloads of counting queries under differential pri- 
vacy. The mechanism can be seen to automatically select, 
for a given workload, a noise distribution composed of lin- 
ear combinations of independent Gaussian noise. With no 
reduction in privacy, the mechanism can signiflcantly reduce 
error over competing techniques and is close to optimal with 
respect to the class of perturbation methods considered. 

In the future we hope to extend our theoretical approxi- 
mation bounds to the eigen-separation and principal vector 
optimizations, and apply our approach to non-linear queries. 
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